Changhua Sun, IBM
Research - China, schangh@cn.ibm.com [PRIMARY contact]
Weishan Dong, IBM Research - China, dongweis@cn.ibm.com
Peter Bak, IBM Research - Haifa, peter.bak@il.ibm.com
Harold-Jeffrey
Ship, IBM Research - Haifa, harold@il.ibm.com
Lei Shi, IBM
Research - China, shllsh@cn.ibm.com
Heng Cao, IBM Research - China, hengcao@cn.ibm.com
Zhong Su, IBM Research - China, suzhong@cn.ibm.com
We use ArcObjects
SDK 10 for Java (ArcGIS Engine 10) to build a
standalone application to visualize the geospatial-temporal data. We exploit the ArcObjects SDK to create shapefiles,
filter GIS features based on spatial location or attributes and
process data based on spatial relationships. We also use ArcGIS
Desktop 10 (ArcMap) to translate
“Vastopolis_Map.png” to zones (polygon), river and lakes (polygon), hospitals, stadiums and city
administrations (point) shapefiles. ArcMap is also
utilized to create a map document (.mxd) for our standalone application. In addition, IBM Spatiotemporal Visual Analytics Workbench is used for
advanced color mapping and temporal filtering, developed by IBM Research -
Haifa / Israel.
We use VisWorks Peony visualization framework, developed by the members of Smart Visual Analytics team, IBM Research - China, between 2007~2011, to draw line charts to discover patterns in temporal distributions.
We use Mallet
to preprocess the microblog text entries and extract relevant topics and
keywords.
To provide quantitative analysis result and reasoning, we use the
Generalized Spatial Association Rule (GSAR) mining tool developed by Spatial
Analytics & Applications team, IBM Research - China, between 2010~2011, to
mine spatiotemporal association rules from the data.
Video:
ANSWERS:
MC 1.1 Origin and Epidemic Spread: Identify approximately where the
outbreak started on the map (ground zero location). If possible, outline the
affected area. Explain how you arrived at your conclusion.
The epidemic outbreak started on 5/18
around 8:00am from three landmarks: Vastopolis Dome,
North of Vastopolis City Hospital, and Southwest of
Convention Center. The affected areas
cover all Vastopolis zones, with Downtown, Uptown,
Eastside, Smogtown the heaviest hit.
We
arrive at this answer by:
l Data preprocessing: apply
text analytics on microblogs to extract flu-like
topics and keywords
l Visualization:
Visualize temporal trend of number of microblogs (Figure
1) and narrow down the outbreak range
Visualize the spatiotemporal
distribution of flu-like microblogs on the
cartographic map. Render the zones by flu report rate normalized by population
(Figure 2)
l Visual Exploration:
Compare the flu-like microblog
distributions before and after the outbreak. Verify with the association rule
Explore the temporal spread of the
epidemic with the flu report rate
Figure 1 Spatiotemporal distribution of flu-like microblogs around the outbreak
Figure 2
Spatiotemporal visualization of flu-like microblogs
with flu report rate
MC 1.2 Epidemic Spread: Present a hypothesis on how the infection is
being transmitted. For example, is the method of transmission person-to-person,
airborne, waterborne, or something else? Identify the trends that support your
hypothesis. Is the outbreak contained? Is it necessary for emergency management
personnel to deploy treatment resources outside the affected area? Explain your
reasoning.
A. Hypothesis
of the transmission
The infection is
being transmitted person-to-person, by air, and by water. The epidemic was
carried by the west wind to Eastside, and also brought by the Vast River to Smogtown and Plainville. In addition, it was spread from
person-to-person on public transportation and crowded city-centers.
On 5/20, many people with flu-like symptoms go to hospitals. But
there are still many people who don’t go to hospitals. Thus, we suggest emergency
management personnel deploy treatment resources, like notifying the City of
Vast River downstream, and warn people especially in Cornertown,
Villa and Southville to avoid public places like Vastopolis Dome and Convention Center.
B. Analytics
Processes
Figure 3 schematically
depicts the
analytic pipeline, consisting of three closely related and highly iterative
parts: preprocessing, automatic analysis and mining, and visualization, described
in the coming subchapters.
Figure 3 Pipeline of our
analytics process
B.1 Data Pre-Processing
We input 1M+ microblogs into Mallet and use Latent Dirichlet
Allocation method to train topics from the corpus. The number of topics is set
to 10, out of that, we manually select 3 topics related to the epidemic. Then
from the keywords of these 3 topics (each topic has 50 keywords), we further
select 34 flu-like keywords, such as “flu”, “chill”, “pneumonia”, “stomach”,
“diarrhea”, “fatigue”, “sweat”, “nausea”, etc. These two manual steps take us
approximately 30 minutes initially and another 30 minutes later on to
add/remove keywords based on analytics results. Finally, we scan the microblogs to match these flu-like keywords with the
contents. Each microblog is then associated with 34
keyword tags, which indicate the presence of the keywords in the microblog.
We use ArcMap to manually translate “Vastopolis_Map.png” to zones,
river, lakes, hospitals, stadiums and city administration shapefiles.
This takes 2 hours.
We use ArcObjects SDK to create a microblog point shapefile with
keywords attributions in 30 minutes. We use the SDK to identify the zone for
each microblog point, and then compute the number of
persons moving between two zones (movements). Two microblogs
collected from two zones written by the same author on the same day but at
different time, increase the movement between these zones by one. We translate
the movements to a line shapefile. This whole process
takes 30 minutes.
B.2 Association rule mining
The GSAR
mining tool computes two spatial relationships, “close to” and “within”, between
each microblog record and all the other spatial
objects including public buildings, river, lakes, and zones. The
“within” relationship is defined by topologically within. The “close to”
relationship is defined as true if the geographical distance between a microblog record and a spatial object is smaller than 1km.
The rules are
in the form of A=>B(s,c), where A and B are
combinations of temporal attributes, spatial attributes, and keywords, s is
support of the rule indicating how many records satisfy A and B, c is
confidence of the rule indicating the probability of P(B|A)=P(AB)/P(A).
Interpretation of the rule can be: if A happened, then B happened, with support
s and confidence c.
B.3 Visualization
We exploit the ArcObjects
SDK to develop a standalone application with flexible time range control and
flu-like keywords filtering. We also render each zone by flu report rate, which
is the number of authors reporting flu for the specific time normalized
by the daytime population of the zone.
In addition, we use spatial relationship queries to identify the
authors who go to hospitals on 5/20, and then add a tag indicating whether the
author goes to hospitals to all the microblogs.
We integrate the Peony framework to draw line charts.
The visualization process takes us 3 hours.
B.4 Visual Reasoning
1. Person-to-person
Figure 4
illustrates the visualization of people’s movements. The size of line
represents the number of movements. We
see that Uptown, Suburbia, Northville, Westside, Plainville, Lakeside, and Eastside
have large movements with Downtown.
Figure 2
visualizes the flu report rate and microblog points
by time. It shows
that some of the zones with large movements to and from Downtown also have a large
flu report rate. Except for Smogtown,
spread to other zones may be caused by people’s movements.
Figure 4 Visualizing
the people’s movements indicated by the microblogs
2. Airborne
We visualize the flu report rate and microblog
points with flu keywords as shown in Figure 5
for 5/18-5/20. On 5/18, the epidemic spread from Downtown/Uptown to Eastside.
This spreads may be caused by west wind on 5/18.
3. Waterborne
In Figure 5,
on 5/19, the flu report rate for Smogtown is less only
than Downtown. On 5/20, the flu report rate for Smogtown
is the largest. The spread from
Downtown/Uptown to Smogtown and Plainville may be
caused by the Vast River which flows south.
Figure 5 Visualizing
the flu report rate for each zone on three consecutive days
To confirm our hypothesis, we exploit association rule mining to
discover correlation between flu symptoms, time, and landmarks.
As shown in Figure 6,
over 76% microblogs with “diarrhea” were reported on
5/20, and were in the bottom-left zone area, close to the Vast River. This
indicates “diarrhea” was spread along Vast River. Similarly, rules indicating the outbreak
on 5/18 within Downtown and Uptown can also be mined.
Figure 6 Visualizing
association rule mining result. Some
typical rules and the related microblog data points
are highlighted on the map
4.
Forecasting
As shown in Figure 7, we divide the flu-like keywords into two groups. For one
group, the number of microblog decreases greatly or close to zero on 5/20,
while the other group is the opposite.
The epidemic with symptoms in the first group can be considered as contained.
Though many people with symptoms of the second group goes to hospitals on 5/20,
there are still many people who don’t go to
hospitals, especially in Smogtown, like “diarrhea”
in Figure 6.
Therefore,
we suggest emergency management personnel to deploy treatment resources.
(a)
Flu symptoms
contained
(b) Flu
symptoms not contained
Figure 7
Spatiotemporal distribution of microblogs with two group keywords